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Abstract 

Principal components analysis (PCA) is the optimal linear auto-encoder of data, and it is often used to 
construct features. Enforcing sparsity on the principal components can promote better generalization, 
while improving the interpretability of the features. We study the problem of constructing optimal sparse 
linear auto-encoders. Two natural questions in such a setting are: 

(i) Given a level of sparsity, what is the best approximation to PCA that can be achieved? 

(ii) Are there low-order polynomial-time algorithms which can asymptotically achieve this optimal trade¬ 
off between the sparsity and the approximation quality? 

In this work, we answer both questions by giving efficient low-order polynomial-time algorithms for con¬ 
structing asymptotically optimal linear auto-encoders (in particular, sparse features with near-PCA recon¬ 
struction error) and demonstrate the performance of our algorithms on real data. 


1 Introduction 


An auto-encoder transforms (encodes) the data into a low dimensional space (the feature space) and then 
lifts (decodes) it back to the original space. The auto-encoder reconstructs the data through a bottleneck, 
and if the reconstruction is close to the original data, then the encoder was able to preserve most of the 
information using just a small number of features. Auto-encoders are important in machine learning be¬ 
cause they perform information preserving dimension reduction. The decoder only plays a minor role 
in verifying that the encoder didn't lose much information. It is the encoder that is important and con¬ 
structs the (useful) low-dimensio nal feature vector. Nonlinear auto-encoders played an important role in 
auto-associative neural networks 


1988: 

Oia. 

19911. 

maps ioja 

19921 


icottrell 


and Mrmro 


19881 Baldi and Hornik . 19881: Bourlard and Kamp , 


Oia , 1991 1- A special case is the linear auto-encoder in which both the decoder and encoder are linear 


199211 . Perhaps the most famous linear auto-encoder is principal components analysis (PCA), 


in particular because it is optimal: PCA is the linear auto-encoder that preserves the maximum amoimt of 
information (given the dimensionality of the feature space). We study general linear auto-encoders, and 
enforce sparsity on the encoding linear map on the grormds that a sparse encoder is easier to interpret. A 
special case of a sparse linear encoder is sparse PCA. 

More formally, the data matrix is X s (each row xj G is a data point in d dimensions). Our 
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focus is the linear auto-encoder, which, for fc < d, is a pair of linear mappings 


and 

^ R^, 

specified by an encoder matrix H G and a decoder matrix G G For data point x G the 

encoded feature is 

Z = h(x) = H^x e R'' 

and the reconstructed datum is 

X = g(z) = G^z G R'^. 

Using X G R”^'^ to denote the reconstructed data matrix, we have 

X = XHG. 


The pair (H, G) are a good auto-encoder if X w X, using the squared loss: 

Definition 1 (Information Loss ^(H, X)). The information loss of linear encoder H is the minimum possible recon¬ 
struction error for X over all linear decoders G; 

t(H,X) ||X-XHG||p= ||X - XH(XH)tx||p . 

The last formula follows by doing a linear regression to obtain the optimal G 

PGA is perhaps the most famous linear auto-encoder, because it is optimal with respect to information 
loss. Since rarLk(XHG) < k, the information loss is bounded by 

t(H,X) > ||X-Xfc||^, 

(for a matrix A, A^ is its best rank-A: approximation). By the Eckart-Yormg theorem X^ = XVfcV^, where 



and feature extraction. While PGA simplifies the data by concentrating as much information as possible into 
a few components, those components may not be easy to interpret. In many applications, it is desirable to 
"explain" the PCA-features using a small number of the original dimensions because the original variables 
have direct physical significance. For example, in biological applications, they may be genes, or in financial 
applications they may be assets. One seeks a tradeoff between the fidelity of the features (their ability to 
reconstruct the data), and the interpretability of the features using a few original variables. We would like 
the encoder H to be sparse. Towards this end, we introduce a sparsity parameter r and require that every 
column of H have at most r non-zero elements. Every feature in an r-sparse encoding can be "explained" 
using at most r original features. Such interpretable factors are known as sparse principal components (SPCA). 
We may now formally state the sparse linear encoder problem that we consider in this work: 
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Problem: Optimal r-sparse encoder (Sparse PCA) 

Given X G e > 0 and k < rank(A), find, for fhe smallesf possible r, an r-sparse 

encoder H for which 

_ f(H, X) = ||X - XH(XH)tX||F < (1 + £)||X - Xfcll^. _ 

Note that we seek a relative-error approximation to the optimal loss. 


1.1 Notation 


Let p < min{n, d] = rank(X) (typically p = d). We use A, B, C,... for matrices and a, b, c,... for vectors. 
The standard Euclidean basis vectors are ei, 62 ,... (the dimension will usually be clear from the context). 
The singular value decomposition (SVD) allows us to write X = USV’^, where the columns of U G R”^^ 


OXp 


IS a 


are the p left singular vectors, the columns of V G R“ ^ are the p right singular vectors, and E G 
diagonal matrix of positive singular values cti > • • • > <Tp', U and V are orthonormal, so U^U = = Ip 


Golub and Van Loan il996ll . For integer k, we use U 


h G 


pnx/c 


(resp. Vfc G R“^'"') for the first k left (resp. 
right) singular vectors, and G R^^^ is the diagonal matrix of corresponding top-fc singular values. We 
can view a matrix as a row of coluirms. So, X = [fi,..., f,;], U = [ui,..., Up], V = [vi,..., Vp], Ufc = 
[ui,..., Ufc] and Vfc = [vi,..., v^]. We use f for the columns of X, the features, and we reserve for the data 
points (rows of X), X^ = [xi,..., x„]. We say that matrix A = [ai,..., a^] is (ri,..., rfe)-sparse if Ija^||p < 
moreover, if all ri are equal to r, we say the matrix is r-sparse. 

The Frobenius (Euclidean) norm of a matrix A is || A||p = Jf-- = Tr(A’^A) = Tr(AA^). The pseudo¬ 


inverse A^ of A with SVD UaEaVa is A' = VaE^ Ua; AA^ = UaUa is a symmetric projection operator. 
For matrices A, B with A^ B = 0, a generalized Pythagoras theorem holds, ||A-|-B||p = ||A||p-|-||B||p. ||A ||2 
is the operator norm (top singular value) of A. 


1.2 Outline of Results 

Polynomial-time Algorithm for Near-Optimal r-Sparse Encoder. We give the first pol 5 momial-time al¬ 
gorithms to construct an r-sparse linear encoder H and a guarantee that the information loss is close to the 
optimal information loss of PCA (see Theorem[7ll, 

t(H,X) < (l + 0(A:/r))||X-Xfc||^. 

Setting r = 0(k/e) gives a (1 -|-e)-approximation to PCA. Our algorithm is efficient, rurming in 0{ndr + {n-\- 
d)r^) time (low-order polynomial). We know of no other result that provides a guarantee on the quality of 
a top-fc sparse linear encoder with respect to the optimal linear encoder (PCA). Our algorithm constructs 
all k factors simultaneously, using a blackbox reduction to column subset selection (see Theorem|5). 

Lower Bound on Sparsity to Achieve Near-Optimal Performance. We give the first lower hound on the 
sparsity r that is required to achieve a (1 -l-e)-approximation to the irrformation loss of PCA (see Theorem|9]l. 
The lower bound shows that sparsity r = il{k/e) is required to guarantee (1 -I- e)-approximate information 
loss with respect to PCA, and hence that our algorithm is asymptotically optimal. 
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Iterative Algorithm for Linear Encoder. Our algorithm constructs all k factors simultaneously, in a sense 
treating all the factors equally. One carmot identify a "top" factor. Most existing sparse PCA algorithms 
first construct a top sparse PCA factor; then, every next sparse factor must be orthogonal to all previous 
ones. We develop a similar iterative algorithm by rurming our "batch" algorithm repeatedly with fc = 1, 
each time extracting a provably accurate sparse factor for a residual. We give a performance guarantee for 
the resulting k factors that are constructed by our algorithm (see Theorem [TT]|. We show that in each step, 
all factors constructed up to that point are provably accurate. Ours is the first performance guarantee for 
any iterative scheme of this type, that constructs sparse factors one by one. 

Our information loss guarantee for the iterative algorithm is approximately a (1 + 0(£:logfc))-factor 
worse than PCA (up to a small additive error). This bound is not as good as for the batch algorithm, but, in 
practice, the iterative algorithm is faster, constructs sparser factors and performs well. 

Experiments. We show the experimental performance of our algorithms on standard benchmark data sets 
and compared with some standard benchmark sparse PCA algorithms. The experimental results indicate 
that our algorithms perform as the theory predicts, and in all cases produces factors which are comparable 
or better to the standard benchmark algorithms. 


1.3 Discussion of Related Work 

PCA is the most popular linear auto-encode r, due to its optimality. Nonlinear auto-encoders became promi- 
nent with auto-associative neural networks 


1988; 

Oja. 

1991, 

19921 


icottrell 


and Mimrol.ll988l:lBaldi and Hornikl.ll988HBourlard and Kamp 


199211 . We are unaware of work addressing the more gereral "sparse linear auto-encoder' 


However, there is a lot of research on "sparse PCA", a special case of a sparse linear auto-encoder. 


Sparse Factors. The importance of sparse fac tors in dimensionality reduction was recognized in some 
early work: the varimax criterion of Kaiser il958ll was used to rotate the factors and encour age spars i ty, and 

this has been used in multi-dimensional scaling approaches to dimension reduction by |SammOT J 19^ [j _ 

Kruskal il964ll . One of the first attempts at sparse PCA used axis rotations and thresholding Icadima and Tolliffel 
199511 . Since then, sophisticated computational techniques have been developed. In general, these methods 
address finding just one sparse principal component, and one can apply the algorithm iteratively on the 
residual after projection to get additional sparse principal components. 


Minimizing Information Loss versus Maximizing Variance. The traditional formulation of sparse PCA 
is as a cardinality constrained variance maximization problem: maximize v^Av subject to v^v = 1 and 
||v||g < r, for A S and r < n. A straightforward reduction from MAX-CLIQUE shows this problem to 

be NP-hard (if A is the adjacency matrix of a graph , then A has a clique of size k if and only if v^Av > (fc— 1) 
for a fc-sparse unit vector v, see iMagdon-Ismaill ll2015lh . Sparse PCA is a special case of a generalized 


eigenvalue problem: maximize v^Sv subject to v^Qv = 1 and ||v||p < r. This generalized eigenvalue 


NP-hard 


Natarajan il99.4 : 


Foster et al. 


hoiA 


problem is known to be NP-hard Mog, haddam et al\ 11200811 . via a reduction from sparse regression which is 


This view of PCA as the projection which maximizes variance is due to a historical restriction to sym¬ 
metric auto-encoders (so G = H^), see for example Oja il992ll . The PCA auto-encoder is symmetric 
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because = V^. Observe that 


mr(X) = ||X||^ = ||X(I-HHt) + XHHt||p 

= ||X-XHHt||p + ||XHHt||p, 


where the last equality is from Pythagoras' theorem. Minimizing the information loss ||X — XHH^||p is 
equivalent to maximizing ||XHH^||p = Tr(H^X^XH), the symmetric explained variance. (A similar decompo¬ 
sition holds for the information loss of a general linear aufo-encoder and fhe "frue" explained variance). 
The fop-fc principal components V k maximize the symmetric explained variance. This view of PCA as the 
projection that captures the maximum symmetric explained variance has led to the historical approach to 
sparse PCA: find a symmefric autoencoder that is sparse and captures the maximum symmetric explained 
variance. The decomposition of fhe variance into a sum of information loss and symmetric explained vari¬ 
ance means that in an unconstrained setting, minimizing information loss and maximizing the symmetric 
explained variance are both ways of encouraging H to be close to V^. However, when H is constrained (for 
example to be sparse), these optimization objectives can produce very different optimal solutions, and there 
are several reasons to focus on minimizing information loss: 


(i) Maximizing variance corresponds to minimizing information loss for symmetric decoders. For the 
general encoder however, variance has no intrinsic value, but information loss directly captures the 
unsupervised machine learning goal: the decoder is secondary and what matters is that the encoder 
produce a compact representation of fhe dafa and preserve as much information as possible. Consfrainfs 
on the decoder (such requiring a symmetric auto-encoder) will translate into a suboptimal encoder 
that loses more information than is neccessary. We are after an encoder into a lower dimensional space 
that preserves as much information in the data as possible. Hence, to get the most informative sparse 
features, one should directly minimize the information loss (placing no constraints on the decoder). 


(ii) An approximation algorithm for information loss can be converted to an approximation algorithm for 
variance maximization. 


Theorem 2. If 


X 


XHH^ 


' <(1 + £)||X 


Xfellp, then 


XHHt 



e||X-Xfc||^ > (l 



2 

F' 


Proof. Since 


X - XHH^ 


= l|X||^- 


XHH 


<{l + e) ||X - Xfc lip, we have 


XHHt 


> l|X||j 


(l + e)||X-Xfc||2 = ||Xfc||^-e||X-Xfc||^. 


The second inequality follows from fhe bound ||X 


Xa 


If< 


\Xk\\l-{p-k)/k. 


Thus, a relative error approximation for reconsfrucfion error gives a relative error approximation for 
explained variance. However, a decoder which gives a relafive error approximafion for symmetric 
explained variance does not immediately give a relative error approximation for information loss. 


(iii) Ex plained variance is not well defined for general encoders, whereas information loss is well defined. 


As 


Zou et al. 


1 20061] poinfs ouf, one has to be careful when defining fhe explained variance for gen¬ 
eral encoders. The interpretation of Tr(H^X^XH) as an explained variance relies on two important 
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properties: the columns in H are orthonormal ("independent") directions and they are decorrelated 
from each other, that is = 1^ and H^X^XH is diagonal. The right singular vectors are the unique 
factors having these two properties. Therefore, when one introduces a cardinality constraint, one has 
to give up one or both of these properties, and typically one relaxes the decorrelation requirement. 
Now, as 

computes a 


Zou et al. 


1200611 


define, and Tr(H^X^XH) is an optipistic estimate of explainied variance. 

"variance after decorrelation" to quantify the quality of sparse PCA. Their solution is not completely 
satisfactory since the order in which factors are sequentially decorrelated can change the explained 
variance. Our solution is simple: use the information loss which is not only the natural metric one is 
interested in, but is always well defined. 


Heuristics for Sparse PCA via Maximizing Variance. Despite the complications with with interpreting 
the explained variance with general encoders which produce correlated factors, all the heuristics we are 
aware of has focussed on maximizing the variance. With a sparsity constraint, the problem becomes com¬ 
binatorial and the exhaustive algorithm requires 0{dr^{ f)) computati on. This exponent ial running time 


can be improved to for a rank-g perturbation of the identity lAsteris et all 12011 1. None of these 


exhau stive algorithms are practical for high-dimensional data. Several heuristics exist 


1200.31 and 


Zou et al. 


1 200611 take an Li penalization view. DSPCA (direct sparse PCAl Id'Aspremont et al 


Trendafilov et al 


1 200711 a lso uses an Li sparsifier but solves a relaxed convex semidefinite program which is further re¬ 
fined in d'Aspremont et al. I i2008ll where they also tractable sufficient condition for testi ng optimality. The 


simpl est algorithms use greedy forward and backward subset selection. For example, iMoghaddam et al 


1 200611 develop a greedy branch and bound algorithm based on spectral bounds with 0{(P) running time 
for forward selection and O(d^) rurming time for backwar d selection. An alternat ive view of the problem 
is as a sparse matrix reconstruction problem; for example IShen and Huangl 11200811 obtain sparse principal 
components using regularized low-rank matrix approximation. 


Theoretical guarantees. The heuristics are quite mature, but few theoretical guaran tees are known. W e 
are not aware of any polynomial time algorithms with guarantees on optimality. In 


Asteris et al 


torn . 


the sparse PCA problem is considered with an additional non-negativity constraint; and the authors give 
an algorithm which takes as input a parameter k, has running time logd -I- d^r^) and constructs a 

sparse solution that is a (1 — ^||S — Sfc|| 2 /||S|| 2 )-factor from optimal. The running time is not practical when 
k is large and the approximation guarantee is only non-trivial when the spectrum of S is rapidly decaying. 
Further, the approximation guarantee only applies to constructing the first sparse PCA and it is not clear 
how this result can be extended if the algorithm is applied iteratively. 

To our knowledge, there are no known guarantees for top-fc sparse factors with respect to PCA Within 
the more general setting of sparse linear auto-encoders, we solve two open problems: (i) We determine the 
best approximation guarantee with respect to the optimal linear encoder (PCA) that can be achieved using a 
linear encoder with sparsity r; (ii) We give low order polynomial algorithms that achieve an approximation 
guarantee with respect to PCA using a sparsity that is within a factor of two of the minimum required 
sparsity. Our algorithms apply to constructing k sparse linear features with performance comparable to 
top-fc PCA. 
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2 Optimal Sparse Linear Encoder 


We show a black-box reduction of sparse linear encoding to column subset selection, and so it is worthwhile 
to first discuss background in column subset selection. We then use column subset selection algorithms to 
construct provably accurate sparse auto-encoders. Finally, we modify our main algorithm so that it can 
select features iteratively and prove a bound for this iterative setting. 


2.1 Column Subset Selection Problem (CSSP) 

For X = [fi,..., f,i], we let C = [hj, ..., denote a matrix formed using r columns "sampled" from X, 

where 1 < ii < i 2 ■ ■ ■ < ir ^ d are disfincf column indices. Column sampling is a linear operafion, and we 
can use a matrix 17 € to perform the column sampling. 


C = X17, where 17 = 


and Gi are the standard basis vectors in (post-multiplying X by "samples" the ith column of X). The 
columns of C span a subspace in the range of X (which is fhe span of fhe columns of X). Any column 
sampling mafrix can be used fo consfrucf an r-sparse matrix. 

Lemma 3. Let 17 = . • ■, and let A £ be any matrix. Then 17A £ is r-sparse, i.e., 

each column of LI A has at most r non-zero entries. 


Proof. Let A = [ai,... ,afe] and consider Llaj which is column j of 17A. A zero-row of 17 results in the 
corresponding entry in Llaj being zero. Since 17 has r non-zero rows, there are at most r non-zero entries in 
Llaj. ■ 


Given columns C, we define Xc = CC^X to be the matrix obtained by projecting all the columns of 
X onfo fhe subspace spanned by C. For any matrix X whose columns are in the span of C, ||X — Xc||p < 
||X — X||p. Let Xc,fc £ be the optimal rank-fc approximation to Xc obtained via the SVD of Xc. 


Boutsidis et al 


koli i. Xc,fc is a rank-k matrix whose columns are in the span ofC. 


Lemma 4 (See, for example. 

Let X be any rank-k matrix whose columns are in the span ofC. Then, ||X — Xc,fe||p < ||X — X||j 


That is, Xc,fc is the best rank-fc approx imation to X whose columns are in the span of C. An efficient 


algorithm to compute Xc,fe is also given in 
time. We reproduce that algorithm here. 


Boutsidis et al. 


1 201411 . The algorithm runs in 0{ndr -i (n -i d)r^) 


Algorithm to compute Xc,fc 

Inputs: X £ R”'''', C £ R"''’’, k<r. 

Output: Xc.fc £ R"'"''. 

1: Compute a QR-factorization of C as C = QR, with Q £ R”^’', R £ R’'^’'. 

2: Compute (Q’^X)fc £ R’'^'^ via SVD (the best rank-fc approximation to Q^X). 
3: Return Xc.fc = Q(Q^X)fe £ R"^'^. 


We mentio n that it is possible to com pute a (1 -i e)-approximation to Xc,fe more quickly using randomized 
projections iBoutsidis and Woodruff . section 3.5.2]. 
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2.2 Sparse Linear Encoders from CSSP 

The main result of this section is to show that if we can obtain a set of columns C for which Xc,fc is a good 
approximation to X, then we can get a good sparse linear encoder for X. We first give the algorithm for 
obtaining an r-sparse linear encoder from r-columns C, and then we give the approximation guarantee. 
For simplicity, in the algorithm below we assume that C has full column rank. This is not essential, and if 
C is rank deficient, then the algorithm can be modified to first remove the dependent columns in C. 

Blackbox algorithm to compute encoder from CSSP 

Inputs: X G C G with C = xn and n = [e^,,... , 0 ^^; k<r. 

Output: r-sparse linear encoder H G 
1: Compute a QR-factorization of C as C = QR, with Q G R G 
2 : Compute the SVD of R"^(Q^X)fe, R"i(Q^X)fe = UrSrV^, where 
Ur G Sr g R'='''= and Vr g R'^^^ 

3: Return H = UUr G 

In the algorithm above, since C has full column rank, R is invertible. Note that the algorithm can be 
modified to accomodate r < k, in which case it will output fewer than k factors in the output encoding. 
In step 2, even though R“^(Q’^X)fc is an r x c? matrix, it has rank k, hence the dimensions of Ur, Sr, Vr 
depend on k, not r. By Lemma |3l the encoder H produced by the algorithm is r-sparse. Also observe that 
H has orthonormal columns, as is typically desired for an encoder: 

= U^U^UUr = U^Ur = Ife. 

Actually, the encoder H has a much stronger property than r-sparsity, namely that in every column the non¬ 
zeros can only be located at the same r coordinates. We now compute the running time of the algorithm. The 
first two steps are as in the algorithm to compute Xc,fc and so take 0{ndr+{n+d)r^ ) time (the multiplication 
by R~^ and the computation of an additional SVD do not affect the asymptotic running time). The last step 
involves two matrix multiplications which can be done in 0{r^k -I- drk) additional time, which also does 
not affect the asymptotic running time of 0{ndr -I- (n -I- d)r^). We now show that the encoder H produced 
by the algorithm is good if the columns result in a good rank-A: approximation Xc,fc. 

Theorem 5 (Blackbox encoder from CSSP). Given X G C = XU G R"^’' with U = [e^^,..., and 

k < r, let H be the r-sparse linear encoder produced by the algorithm above, which runs in 0{ndr d)r'^) time. 

Then, the information loss satisfies 

£(H,X) = ||X-XH(XH)tX||p < ||X-Xc,fe||^. 

The theorem says that if we can find a set of r columns within which a good rank-fc approximation to X 
exists, then we can construct a good linear sparse encoder. 

Proof. We show that there is a decoder G for which IIX — XHGlip = ||X — Xc,fc||p. The theorem then follows 
because the optimal decoder cannot have higher reconstruction error. We use Ur, Er and Vr as defined in 
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the algorithm, and set G = ErV^. We then have, 

XHG = Xt^URSRV^ 

= xnR-\Q^x)k 
= CR-^\Q^X)k 

The first step is by construction in the algorithm because Ur , Er , Vr is obtained from the S VD ofR~^(Q^X)fc. 
The second step is because C = XU. Since G = QR (and R is invertible), it follows that CR^^ = Q. Hence, 


XHG = Q(Q"X)fe 

= Xc,fc, 

hence ||X — XHG||p = ||X — Xc,fe||p, concluding the proof. ■ 

What remains is to find a sampling matrix U which gives a good s et of columns G = X U for which 
||X ~ '^c,k lip is small. The main tool to obtain G and U was developed in Boutsidis et al. 1 2014 1 which gave 
a constant factor deterministic approximation algorithm and a relative-error randomized approximation 
algorithm. We state a simplified form of the result and then discuss various ways in which this result can 
be enhanced. The main point is that any algorithm to construct a good set of columns can be used as a black 
box to get a sparse linear encoder. 


Theorem 6 (Near-optimal CSSP 


Boutsidis et al. 


I201J). Given X e 


pn X d 


of rank p and target rank k: 


(i) (Theorem 1 in iBoutsidis et alJ iZOlm ) For sparsity parameter r > k, there is a deterministic algorithm which 


runs in time Tv^+0{ndk + dk^) to construct a sampling matrix U = [e^j,..., and corresponding columns 
G = XU such that 


||X-Xc.fc||| < 1 + 


1 


(1 - v^)^ 


IIX-X 


fc IIf' 


(ii) (Simplified Theorem 5 i mBoutsidis et al.\l20l4l ) For sparsity parameter r > 5k, there is a randomized algorithm 
which runs in time 0{ndk + dk^) to construct a sampling matrix U = ,..., and corresponding columns 

C = XU such that 


E 


l|X-Xc,fe| 


< 1 + 


5k 


— 5k 


IIX-X, 


IF- 


Proof. (Sketch for pa rt (ii).) We follow the proof in 


Boutsidis et al 


using the same notation in 


Boutsidis et al. I i2014ll . Ghoose 5k columns for the initial constant factor approximation and set the accu¬ 
racy £o for the approximate SVD to compute V, to a constant so that cq = (1 -I- £o)(l + (1 — ^^1/5)“^) = 5. 
The adaptive sampling step ensures a (1 -I- cqA:/ s)-approximation for the reconstruction, where s = r — 5k. 


Comments. 

1. In part (ii) of the theorem, setting r = 5k + 5k/e = 0{k/e) ensures that Xc,, is a (1 -I- e)-reconstruction 
of Xfc (in expectation). At the expense of an increase in the running time the sparsity can be reduced to 
r ss 2fc/e while still giving a (1 -i e)-reconstruction. The message is that 0(fc/e)-sparsity suffices to get a 
(1 -I- e)-accurate reconstruction. 
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2. Part (ii) of the theorem gives a bound on E[||X — Xc,fe||p]. The expectation is with respect to random 
choices in the algorithm. Using an application of Markov's inequality to the positive random variable 
||X — Xc,fc||p — ||X — Xfcllp, ||X — Xc,fc||p < (1 + 2£:)||X — Xfcllp holds with probability at least i. This can 
be boosted to high-probability, at least 1 — d, for an additional log j factor increase in the rurming time. 

3. In part (i) of the theorem Vfc (the top-fc right singular vectors) are used in the algorithm, and Ty^, is the 
time to compute V^. To compute exactly, the only known algorithm is via the full SVD of X, which 
takes 0{ndmm{n, d}) time. It turns out that an approximate SVD will do. Suppose that Vfc satisfies 

||X-XVfc.vI.||p<a||X-Xfcllp. 


The approximation Vfc can be used in the a lgorithm instead of Vfc and the error will increase by an 

1 201411 gives a n efficient randomized approx imation for Vfc and 


additional factor of 


Boutsidis et al. 


fhere is also a recent deterministic algorithm given in ichashami and Phillips . 2013 , Theorem 3.1] which 
computes such an approximate Vfc in time 0(ndfce“^) with a = 1 + e. 


Boutsidis et al 


1 201411 . We 


4. The details of the algorithm that achieves part (ii) of fhe theorem are given in 
simply state the three main steps. 

(i) Use a randomized projection to compute an approximation Vfc. 

(ii) Apply the algorithm from parf (i) of the theorem using Vfc to get a constant factor approximation. 

(iii) Apply one round of randomized adaptive column samplmg lDeshpande and Vempalal 1120061] which 
boosts the constant factor approximation to a 1 -I- £ approximation. 

This entire algorithm can be de- randomized to give a determ inistic algorithm by using the determin¬ 


istic approximation for V k from Ghashami and Phillip.s J2013 1 and a derandom izafion of fhe adapti 


Boufsidis and Woodruff 


1201 4 I 1 . 


:ive 


The fradeoff will be a 


sampling step which appeared in a recent result 
non-trivi al increase m the rurming time. The entire algorithm can also be made to run in mput-sparsity- 
time f see Boutsidis and Woodruffi for details). 

5. Doubly sparse auto-encoders. It is possible to construct an 0(fc/£)-sparse linear auto-encoder and a 
corresponding "sparse" decoder with the following property: the reconstruction XHG has rows which 
are all linear combinations of 0{k/£) rows of X. Thus, the encoding identifies features which have 
loadings on only a few dimensions, and the decoder for the encoding is based on a small number of data 
points. Further, the reconstruction error is near optimal, ||X — XHG||p <(1+£)||X-Xfcllp. 

We give the rough sketch for this doubly sparse linea r auto-encoder, without ge tting into too many 


details. The main tool is the C$R matrix decomposition 


optimal construction was given in Boutsidis and Woodruf 

and a rank-fc matrix $ € 


j^nxri rows R = U^X e 
£)||X - Xfcllp, where Ui = [e^j, 


‘rq 

/T 


and Uq = [efc 


Drineas and KannanI 11200311 . An asymptotically 
lEU which constructs columns C = XUi € 
]^i'ixr2 £qj. Yvhich ||X — C<l>R||p < (1 -|- 
are sampling matrices with ri = 0{k/£) 


iXk 


and £2 = 0{k/e). Let $ = U^E^Vj be the SVD of 4>, where U^ G E$ G V$ G 

Then, we set the encoder to H = UiU$ which is 0(fc/e)-sparse, and the decoder to G = E^VjU^X. The 
reconstruction is XUiU^E^VjU^X; The matrix U^X contains 0{k/e) rows of X and so the reconstruction 
is based on linear combinations of these rows. 

The trade off for getting this doubly-sparse encoding is a decrease in the accuracy, since the decoder is 
now only getting an approximation to the best rank-fc reconstruction within the column-span of C. The 
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advantage of the doubly sparse encoder is that it identifies a small set of 0{k/£) original features that 
are important for the dimension-reduced features. It also simultaneously identifies a small number of 
0{k/e) data points that are important for reconstructing the entire data matrix. 

We are now ready for the main theoretical result of the section. By using Theorem 0 in our black-box 
linear encoder, we obtain our algorithm to construct a sparse linear encoder with a provable approximation 
guarantee. 

Sparse Linear Encoder Algorithm 

Inputs: X S target rank k < rank(X); sparsity r > k. 

Output: Near-optimal r-sparse linear encoder H € 

1: Use the algorithm from Theorem |6]-(ii) to compute columns C = Xfl € with 

inputs X, k, r. 

2: Return the encoder H computed by using X, C, A: as input to the CSSP-blackbox en¬ 
coder algorithm. 

Using Theorem[6]m Theorem|5l we have an approximation guarantee for our algorithm. 

Theorem 7 (Sparse Linear Encoder). Given X G of rank p, the target number of sparse PCA vectors k < p, 

and sparsity parameter r > 5k, there is a randomized algorithm running in time 0{ndr -|- (n -i d)r^ + dk^) which 
constructs an r-sparse encoder H such that: 


E[£(H,X)=E ||X-XH(XH)^X||j 


< 1 + 


5k 


— 5k 


IX -X 


2 

/clip* 


Comments. 

1. The expectation is over the random choices made in the algorithm to construct H. All the comments 
from Theorem [6] apply here as well. In particular, this result is easily converted to a high probability 
guarantee, or even a deterministic guarantee at the expense of some accuracy and a (pol 5 momial in d, r)- 
factor increase in the computational cost. 

2. The guarantee is with respect to ||X — X^ ||p, which is the best possible rank-fc reconstruction with k dense 
optimal features obtained from PCA. Our result shows that 0(fc/e)-sparse features suffices to mimic top- 
k (dense) PCA within a (1 -t- e)-factor error. 

We presented the simplified version of the results requiring sparsity r > 5k. Actually our technique can 
be applied for any choice oi r > k with approximation guarantee c(l -|-1/(1 — \/k/rY) where c ~ 1 is a 
constant. 

3. The reconstruction error using our r-sparse encoder is compared with PCA, not with the optimal r- 
sparse encoder. With respect to the optimal r-sparse encoder, our approximation could be much better. 
It would be a challenging task to get an approximation guarantee with respect to the best possible r- 
sparse linear-encoder. 

2.3 Lower Bound on Sparsity 

We define the combined sparsity of H to be the number of its rows that are non-zero. When k=l, the combined 

sparsity equals the sparsity of the single factor. The combined sparsity is the total number of dimensions 
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which have non-zero loadings among all the factors. Our algorithm produces an encoder with combined 
sparsity 0{k/e) and comes within (1 + e) of fhe minimum possible information loss. We show that this 
is worst case optimal. Specifically, there is a matrix X for which any linear encoder which can achieve a 
(1 -I- e)-approximafe reconstruction error as compared to PCA must have a combined sparsity r > k/e. So 
n(fc/e)-sparsity is required to achieve a (1 -I- £)-approximation. The common case that is studied in the 
literature is with k = 1 (constructing a sparse top principal component). Our lower bormd shows that 
n(l/e)-sparsity is required to get a (1 -I- e) -approximation and our algorithm asymptotically achieves this 
lower bound, therefore we are asymptotically optimal. 

We show the converse of Theorem namely that from a linear auto-encoder with combined sparsity r, 
we can co nstruct r columns C f or which Xc.fe is a good approximation to X. We then use the lower bormd 


proved in ihoutsidis et al. , 2014 , Section 9.2] which states that for any d > 0, there is a matrix B such that for 


any set of r of its columns C, 


|B - Bc,k\\l > l|B - CC^BIIp >(l+’l-S]\\B-B 


fellp- 


( 1 ) 


Now suppose that H is a linear encoder with combined sparsity r for this matrix B from iBoutsidis et al. 
2014 Section 9.2], and let G be a decoder satisfying 


|B-BHG||^ < (l + e)||B-Bfe||^. 


We offer the following elementary lemma which allows one to construct columns from a sparse encoder. 

Lemma 8. Suppose H is a linear encoder for B with combined sparsity r and with decoder G. Then, BHG = CY for 
some set ofr columns ofB, denoted G, and some matrix Y € 

Proof. Let 1 < ii < *2 ■ • ■ < ir < d he the indices of the non-zero rows of H and let fl = ,..., be 

a sampling matrix for B. Every column of H is in the subspace spanned by the columns in Tt. Hence, the 
projection of H onto the columns in ft gives back H, that is H = So, BHG = BflfP^HG = CY, where 

C = Bfl and Y = fl^HG. ■ 

By LemmalU and because ||B — CC^B||p < ||B — CY||p for any Y, we have: 

||B - CC^BIIp < ||B - CY||^ = ||B - BHGIll < (1 + £)||B - Bfe||^. (2) 

It follows from eq:lowerl and eq:lower2 thaf e > k/r. Our algorithm gives a reconstruction error e = 
0(k/r), and so our algorithm is asymptotically worst-case optimal. The conclusion is that no linear encoder 
with combined sparsity r can achieve an approximation ratio which is smaller than (1 + k/r). 

Theorem 9 (Lower Bound on Sparsity). There exists a data matrix X for which every r-sparse encoder with 
r < k/e has an information loss which is strictly greater than (1 -I- e)||X — Xfc||p. 

This theorem holds for general linear auto-encoders, and so the lower bound also applies to the sym¬ 
metric auto-encoder HH^, the traditional formulation of sparse PCA. For the case k = 1, any r-sparse unit 
norm v, 11B — B vv^ ||p > (l-ii)||B — Bi||p, or the explained variance (the variance in the residual) is B’^ B v 
which is upper-bormded by 

v"B"Bv< ||Bi||^-i||B-Bi||^ 
r 
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Our algorithm achieves an explained variance which is ||Bi||p — n(i)||B — Bi||p. Note that with respect to 
explained variance, the input matrix A = [1,1,..., 1] immediately gives an upper bound on the approxi¬ 
mation ratio of (1 — r/d) for fhe top r-sparse PC A, since every subset of r columns of A has specfral norm 
r whereas fhe spectral norm of A is d. 

2.4 Iterative Sparse Linear Encoders 

Our previous result is strong in the sense that every vector (column) in the linear encoder is r-sparse with 
non-zero loadings on the same set of r original feature dimensions. From our lower bound, we know that 
the combined sparsity of H must be atleastfc/eto achieve reconstruction error (l-|-e)||X — Xfe|| (in the worst 
case), so our algorithm from Secfion l2l2l is optimal in that sense. The algorithm is a "batch" algorithm in the 
sense that, given k, it constructs all the k factors in H simultaneously. We will refer to this algorithm as the 
"batch algorithm". The batch algorithm may have non-zero loadings on all the r non-zero rows for every 
feafure vecfor (column h^ of H). Further, the batch algorithm does not distinguish between the fc-factors. 
That is, there is no top component, second component, and so on. 

The traditional techniques for sparse PCA consfrucf fhe facfors iteratively. We can run our batch algo¬ 
rithm in an iterative mode, where in each step we set k = 1 and compute a sparse factor for a residual 
matrix. By constructing our k features iteratively (and adaptively), we identify an ordering among the 
k features. Further, we might be able to get each feature sparser while still maintaining a bound on the 
combined sparsity. The problem is that a straightforward reduction of the iterative algorithm to CSSP is 
not possible. So we need to develop a new iterative version of fhe algorifhm for which we can prove an 
approximation guarantee. The iterative algorithm constructs each factor in the encoder sequentially, based 
on the residual from fhe reconsfrucfion using the previously constructed factors. One advantage of the 
iterative algorithm is that it can control the sparsity of each factor independently to achieve the desired 
approximation bound. Recall that the encoder H = [hi,..., h^] is (ri,..., rfc)-sparse if ||hi||g < r^. The 
iferafive algorifhm is summarized below. 


Iterative Sparse Linear Encoder Algorithm 

Inputs: X G target rank k < rank(X); sparsity parameters ri,..., r^. 

Output: (ri,..., rfc)-sparse linear encoder 

1 

Set the residual A = X and H = [ ]. 

2 

for i = 1 to fc do 

3 

Compute an encoder h for A using the batch algorithm with k = 1 and sparsity 
parameter r = rt. 

4 

Update the encoder by adding h to it: H -(— [H, h]. 

5 

Update the residual A: A ^ X - XH(XH)lX. 

6 

end for 

7 

Return the (ri,..., 7’fe)-sparse encoder H G 


The main iferafive sfep in the algorithm is occuring in steps 3,4 above where we augment H by computing a 
top sparse encoder for the residual obtained from the current H. The next lemma bounds the reconstruction 
error for this iterative step in the algorithm. 
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Lemma 10. Suppose, for k >1, = [hi,, h^] is an encoder for X, satisfying 


||X-XHfc(XHfe)tX||p = err. 


Given a sparsity r > 5, one can compute in time 0{ndr + {n + d)r‘^) an r-sparse feature vector hfc+i such that for 
the encoder Hfc+i = [hi,..., h^, h^+i], the reconstruction error satisfies 


E 


||X-XHfc+i(XHfc+i)tX||F 


(1 + 5)(err - ||X - XHfe(XHfc)tX|| 2 ), 


where <5 < 5/(r — 5). 


Proof Let Gfc = (XHfc)^X and B = X — XH^Gfe. We have that ||B[|p = err. Run our batch algorirhm from 
Theorem[7]on B with k = 1 and sparsity parameter r to obtain an r-sparse encoder hfc+i and corresponding 
decoder g^+i satisfying 


E 


||B — Bhfc+igfc+i||p 


(1 + J)||B-Bi||^, 


with 6 < 5/(r - 5). Now, the RHS is ||B - Bi|[p = |[B||p - ||B ||2 = err - ||S|| 2 . The LHS is 


E 


||B - Bhfe+igfc+illp 


||X - XHfeGfe - Xhfc+ig^+i + XHfeGfchfe+ig^+illp 
||X - XHfc(Gfc - Gfehfc+ig^+i) - Xhfc+ig^+ 

||X-XHfc+iGfc+i||^ 




where Gfc+i = [G^ — gfc+ih^_^^Gfc, gfc+i]. That is, Hfc+i with the decoder G^+i satisfies the expected error 
boimd in the lemma, so the optimal decoder cannot do worse in expectation. ■ 


We note that the iterative algorithm produces an encoder H which is not guaranteed to be orthonormal. 
This is not critical for us, since the information loss of the encoder is based on the minimum possible 
reconstruction error, and we can compute this information loss even if the encoder is not orthonormal. 
Observe also that, if so desired, it is possible to orthonormalize the columns of H without changing the 
combined sparsity. 

Lemma [To] gives a bound on the reconstruction error for an iterative addition of the next sparse encoder 
vector. As an example of how we apply Lemma ITOl suppose the target rank is fc = 2. We start by construct¬ 
ing hi with sparsity ri = 5 + 5/e, which gives us (1 -f e)||X — Xi |[p reconstruction error. We now construct 
h 2 , also with sparsity r 2 = 5 + 5/e. The final reconstruction error for H = [hi, h 2 ] is bounded by 


||X - XH(XH)tX||p < (1 + e)2||X - X2||^ + e(l + e)||X - Xi||^ 


On the other hand, our batch algorithm uses sparsity r = 10 -|- 10/e in each encoder hi, h 2 and achieves 
reconstruction error (1 -I- e)||X — X 2 ||p. The iterative algorithm uses sparser features, but pays for it a little 
in reconstruction error. The additive term is small: it is 0{e) and depends on ||X — Xi |[2 = cr|, which 
in practice is smaller than ||X — X 2 ||p = + ■ ■ ■ + In practice, though the theoretical bound for the 

iterative algorithm is slightly worse than the batch algorithm guarantee, the iterative algorithm performs 
comparably to the batch algorithm, it is more flexible and able to generate encoder vectors that are sparser 
than the batch algorithm. 

Using the iterative algorithm, we can tailor the sparsity of each encoder vecfor separately to achieve a 
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desired accuracy. It is algebraically intense to prove a bound for a general choice of the sparsity parameters 
ri,..., rfc, so, for simplicity, we prove a bound for a specific choice of the sparsity parameters which slowly 
increase for each additional encoder vector. We get the following result. 

Theorem 11 (Adaptive iterative encoder). Given X G of rank p and k < p, there is an algorithm to compute 

encoder vectors hi, h 2 ,..., h^ iteratively ivith each encoder vector hj having sparsity = 5 + [ 5 j/e ] such that for 
every 1= I,... ,k, the encoder = [hi, h 2 ,..., h^j has information loss: 


E 


|X-XHf(XH^)tX||F <iei) 


||X-X,||^+££i+^||X,-Xi 


1^ 


(3) 


The running time to compute all the encoder vectors is 0{ndk^e ^ 


{n + d)k^e ^). 


Comments. 

1. {eiy is a very slowly growing in For example, for e = 0.01 and £ = 100, ~ 1.04. Asymptotically, 

{eiy = l+0(elog £),so up toasmalladditiveterm,wehavearelativeerror approximation. The message 
is that we get a reasonable approximation guarantee for our iterative algorithm, where no such bounds 
are available for existing algorithms (which are all iterative). 

2. Observe that each successive encoder vector has a larger value of the sparsity parameter Vj. In the batch 
algorithm, every encoder vector has sparsity parameter r = 5A: + Sfc/e to get a reconstruction error of 
(1 + e)||X — Xfcjjp. This means that the number of non-zeros in H is at most 5/c^ -I- 5fc^/e. For the iterative 
encoder, the first few encoder vectors are very sparse, getting denser as you add more encoder vectors 
until you get to the last encoder vector which has maximum sparsity parameter Vk = 5-1-5/c/e. The 
tradeoff is in the reconstruction error. We can get still sparser in the iterative algorithm, setting every 
encoder vector's sparsity to Vj = 5-1- 5/c/e. In this case, the bound on the reconstruction error for Hi 
becomes (1 -I- e)^||X — X^||p -I- ie(l -I- e)^“^£||X^ — Xi||p, which is growing as 1 -I- 0{ei) as opposed to 
1 + 0{e\ogi). One other trade off with the iterative algorithm is that the combined sparsity (number 
of non-zero rows) of H could increase, as compared to the hatch algorithm. In the batch algorithm, the 
combined sparsity is 0{k/e), the same as the sparsity parameter of each encoder vector, since every 
encoder vector has non-zeros in the same set of rows. With the iterative algorithm, every encoder vector 
could conceivably have non-zeros in different rows giving 0(/c^/e) non-zero rows. 

3. Just as with the PCA vectors vi, V 2 ,... „ we have an encoder for any choice of i by taking the first i 
encoder vectors hi,..., h^. This is not the case for the batch algorithm. If we compute the batch-encoder 
H = [hi,..., hfc], we carmot guarantee that the first encoder vector hi will give a good reconstruction 
comparable with Xi. 

Proof. (Theorem [TO For i > 1, we define two quantities Qf, Pi for that will be useful in the proof. 

Qe = (1-I-e)(l-I-ie)(l-I-ie)(l-f |;e) • • • (1-I-je); 

Pe = (1 + £)(1 + • • • (1 + js ) — 1 = — 1 . 

Using Lemma [To] and induction, we prove a bound on the information loss of encoder 


E 


X - XH,(XH,)tX|[pl < Qi\\X - X,||^ + 


i=2 


2 ^j-1 


(*) 
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When£= 1, the claim is that E[||X — XHi(XHi)’fX||p] < (1 + e)||X — Xi||p (since the summation is empty), 
which is true by construction of Hi = [hi] because ri = 5 + 5/e. Suppose the claim in eq:proofl holds up to 
> 1 and consider H^+i = [H^, h^+i], where h^+i has sparsity rg+i = 5 + + l)/e. We apply Lemma [TOl 

with i5 = e/(£ +1) and we condition on err = jjX — XH^(XH£)lX||| whose expectation is given in eq:proofl. 
By iterated expectation, we have that 


(a) 


E 


||X-XH,+i(XH,+i)tX||; 


= EH,Eh,+, ||X-XH,+i(XH,+i)tX||p I H, 


(b) , 

< 1 + 


Eh, 


|X - XH£(XH,)tX||p - jjX - XH,(XHf)tX||^ 


Qi+i 


£ + 1 ' """ 

Q,\\X - X,||^ + ^ ||X - XH,(XH,)+X|| 


V 


< 


Qt+i 

Qi 

Qi+i 

Qe 

Qe+i 

Qe 


Q,||X-X,||^-n/+i+Q,^n| 


2 -Pj-1 


3=2 


Qj-1 


Qe (a,\i + ||X-X,+i||^) 


h+i 


Qey^/ 

3=2 


Q^IIX — Xf_|_i lip + cr/_|_i((5^ — ^) + Qe yy 


2 -Pj -1 

Q.-i 


P3-1 


2^3 


3=2 


Qj-1 


Q,+i||X-X,+i||^ + n/+i 


2 , _2 Qe+iPe 


+Qe+i yy 


2 Pj-i 


3=2 


^Q 


j-i 


^+1 


— Qe+i ||X — X£_|_i lip + Qe+i gj 


2 P3-1 


3=2 


Qj-1 


In (a) we used iterated expectation. In (b) we used Lemma [TOl to take the expectation over h^+i. In (c), 
we used the definition of Qt from which Qe+ijQe = (l + e/(£+l)), and the induction hypothesis to 
bound Eh^ [||X — XH^(XH^)lX||p]. We also observed that, since XH^(XH£)lX is a rank-fc approximation to 
X, it follows from the Eckart-Yormg theorem that ||X — XHf(XH^)lX ||2 > cr^_|_i for any H^, and hence this 
inequality also holds in expectation. In (d) we used the definition Pi = Qi — 1. The bound eq:proofl now 
follows by induction for ^ > 1. The first term in the bound eqthm:mainlterative follows by bounding Qi 
using elementary calculus: 


e e 

log^ log (^1 + I) < ^ - < elog(e^), 

i=l * i=l * 

where we used log(l + x) < x for a; > 0 and the well known upper bound log(e^) for the £th harmonic 
number l + ^ + jH-tj- Thus, Qg < (e£)^. The rest of the proof is to bound the second term in eq:proofl 
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to obtain the second term in eqthm:mainlterative. Obeserve that for i>l, 


n 1 I 1 ^ I /n 1 I r> 

Pi — Qi — ^ — S-p:. -h p:. -1 < £p^ -h Qi-l — 1 — £p=. -h Pi-lj 

Wl Wl b/l Wl 

where we used QijQi < Qi-i and we define Pq = 0. Therefore, 


E' 

i=2 


2 Pj-l 
^ Qj-1 




j=2 


i=3 


2 Pj-2 
^ Qj-1 


Pj — 2 Qj-2 


2^3 


^ Qj-2 Qj-1 


(a) 

< 




2 -Pj-2 


2 , _2 ^J-2 




■||X£-Xi||^ + ^u 


3=3 


Q3-2 


In (a) we used Qj- 2 IQj-i < 1- The previous derivation gives a reduction from which it is now an elemen¬ 
tary task to prove by induction that 


i-i 


i=2 j=l 


2 

J\\F- 


Since IjX^ - Xj\\l < \\Xi - Xi||p(£ - j)/(£ - 1), we have that 


2P3-3 / £||X^ — Xillp ^ , e£||Xf—Xillp 

E 77 —^ ^577737^ E ^ J = 




2Qi 


Using eq:proofl, we have that 


||X - XH,(XH,)tX||p < (e£)^||X - X,||^ + ^ 

The result finally follows because 

t i 

log^ = E^log (^1 -h < ey] i < e(log(eg - 1) = elog£, 


i=2 


2 = 2 


and so QijQx < 


3 Experiments 

We compare the empirical performance of our algorithms with some of the existing state-of-the-art sparse 
PC A methods. The inputs are X g the number of components k and the sparsity parameter r. The 

output is the sparse encoder H = [hi, h 2 ,..., h^] g with ||hi||p < r; H is used to project X onto some 
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subspace to obtain a reconstruction X which decomposes the variance into two terms: 


|x||f = 


X-X 


X 


= Reconstruction Error + Explained Variance 


Previous methods construct the matrix H to have orthonormal columns and restristed the resonstruction to 

, 2 

XEIH^ . Minimizing the reconstruction error is equivalent to maximizing 

F 

the explained variance, so one metric that we consider is the (normalized) explained variance of the symmetric 


symmetric auto-encoders, X = 
the explainer 
auto-encoder, 

Symmetric Explained Variance = 


XHH^ 


llXfcll^ 


< 1 


To capture how informative the sparse components are, we use the normalized information loss: 


Information Loss = 


|X-XH(XH)tX| 


IIX-X, 


> 1 . 


/clip 


The true explained variance when using the optimal decoder will be larger than the symmetric explained 

2 2 

variance because X — XHEI^ > ||X — XH(XH)lX||p. We report the symmetric explained variance pri- 
marily for historical reasons because existing sparse PCA methods have constructed auto-encoders to opti¬ 
mize the symmetric explained variance rather than the true explained variance. 


3.1 Algorithms 

We implemented the following variant of the sparse PCA algorithm of Theorem [T] in the general frame¬ 
work of the algorithm described in Section |2j2l we use the deterministic technique described in part (i) 
in Theorem |6] in order to find the matrix C with r columns of X. (This algorifhm gives a constant factor 
approximation, as opposed to the relative error approximation of the algorithm in Theorem [71 but it is de¬ 
terministic and simpler to implement.) We call this the "Batch" sparse linear auto-encoder algorithm. We 
correspondingly implement an "Iterative" version with fixed sparsity r in each principal component. In 
each step of the iterative sparse auto-encoder algorithm we use the above batch algorithm to select one 
principal component with sparsity at most r. 

We compare our sparse auto-encoder algorithms to the following stafe-of-fhe-arf sparse PCA algo¬ 
rithms: 



All those algorithms were designed to operate for the simple k = 1 case (notice that our algorithms handle 
any k wi thout any mod ification); hence, to pick k sparse components, we use the "deflation" method sug¬ 
gested in [Mackey! 1200911 : let's say hi is the result of some method applied to X, then h 2 is the result of the 
same method applied to (I„ — hih]) • X • (I„ — hih]), etc. 
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Figure 1: Performance of fhe sparse encoder algorithms on the PitProps data (left). Lymphoma data (mid¬ 
dle) and Colon data (right) data: The figures show the Information loss (top) and symmetric explained 
variance (bottom) with k = 2. We can observe that our algorithms give the best information loss which 
appears to be decreasing inversely with r as the theory predicts. Existing sparse PCA algorithms which 
maximize symmetric explained variance, not surprisingly, perform better with respect to symmetric ex¬ 
plained variance. The figures highlight that information loss and symmetric explained variance are quite 
different metrics. We argue that information loss is the meaningful criterion to optimize. 


3.2 Environment, implementations, datasets 



Matlab 8.4.0.150421 (R2014b) in a Macbook machine with 2.6 GHz Intel Core i7 processor and 16 GB of 
RAM. 

Following existing lite rature, we test the above algorithms in the following three datasets (all available 
Yuan and Zhang 12013 11: 1) PitProps: Here, X £ with n = 13 corresponds t o a correlatio n matrix of 


180 observations measured with 13 vari ables. The origina l dataset is described in letters il967ll . 2) Colon: 


This is the gene-expression d ataset from 


Alon et al. 


119990; here, X G jgsooxsoo 3 ^ Lymphoma: This is the 


gene-expression dataset from Alizadeh et al. I i200ol] : here, X G ^ 500 x 500 phose are PSD matrices, hence fit 


the sparse PCA framework we discussed above. 


3.3 Results 

For each dataset, we tried different k and the sparsity r. The qualitative results for different k are similar 
so we only show results for k = 2. We report the results in Figure [l] Notice that we plot the symmetric 
explained variance and the information loss versus the “average column sparsity" of H. This is because 
all algorithms - with TPower being an exception - cannot guarantee column sparsity in H of exactly r, for 
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given r. Our methods, for example, promise column sparsity at most r. The GPower methods control the 
sparsity level through a real parameter 7 which takes values in (0,1). An exact relation between 7 and r 
is not specified, hence we experimented with different values of 7 in order to achieve different levels of 
sparsity 

We show example sparse encoders H = [hi, h 2 ] for the 5 algorithms with k = 2 and r = 5 below 


Batch 

hi 

h2 

Iter. 

hi h2 

hi 

TP 

h2 

GP-£o 

hi h2 

GP-^i 

hi h2 

0 

0 

0 

0 

0.5 

0 

0.7 

0 

0.6 

0 

0.8 - 

0.3 

-0.6 -0.8 

0.5 

0 

0.7 

0 

0.6 

0 

0 

0 

0 

-0.4 

0 

0.6 

0 

0.7 

0 

0.7 

0 - 

0.8 

0 

-0.2 

0 

0.6 

0 

0.7 

0 

0.7 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0.3 

0 

0 

0 

0 

0 

0 

-0.7 

-0.1 

0.4 

0 

0 

0 

0 

0 

0.3 

0.3 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

-0.3 

0 

0.4 

0 

0 

0 

0.5 

0 

0 

0 

-0.1 

0 

0.4 

-0.2 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

-0.2 

0 

0.3 

0 

0 

0 

0 

0.5 - 

0.4 

0 

0 

0 

0 

0 

0 

0 

0 


The sparse encoder vectors differ significantly among the methods. This hints at how sensitive the optimal 
solution is, which underlines why it is important to optimize the correct loss criterion. Since one goal of 
low-dimensional feature construction is to preserve as much information as possible, the information loss 
is the compeling metric. 

Our algorithms give better information loss than existing sparse PCA approaches. However, existing 
approaches have higher symmetric explained variance. This is in general true across all three datasets and 
different values of k. These findings shouldn't come at a surprise, since previous methods aim at optimizing 
symmetric explained variance and our methods choose an encoder which optimizes information loss. The 
figures highlight once again how different the solutions can be. It also appears that our "iterative" algorithm 
gives better empirical information loss compared to the batch algorithm, for comparable levels of average 
sparsity, despite having a worse theoretical guarantee. 

Finally, we mention that we have not attempted to optimize the running times (theoretically or empir¬ 
ically) of our algorithms. Faster versions of our proposed algorithms might give as accurate results as the 
versions we have implemented but with considerably faster running times; for example, in the iterative 
algorithm (which calls the CSSP algorithm with k = 1), it should be possible significantly speed up the 
generic algorithm (for arbitrary k) to a specialized one for k = 1. We leave such implementation optimiza¬ 
tions for future work. 
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4 Discussion 


Historically, sparse PCA has meant cardinality constrained variance maximization. Variance per se does 
not have any intrinsic value, and it is not easy to generalize to arbitrary encoders which are either not 
orthogonal or not decorrelated. However, the information loss is a natural criterion to optimize because it 
directly reflects how good the features are at preserving the data. Information loss captures the machine 
learning goal when reducing the dimension: preserve as much information as possible. 

We have given efficient asymptotically optimal sparse linear encoders An interesting open question is 
whether one can get a (1 + e)-relative error with respect to information loss for the iterative encoder. We 
believe the answer is yes as is evidenced by the empirical performance of the iterative encoder. 

Acknowledgments. We thank Dimitris Papailiopoulos for pointing out the cormection between MAX- 
CLIQUE and sparse PCA. 
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